Skip to main content

Logistic Regression

Logistic regression is a parametric approach to classification. It gives a structure of probability of xx by p(X)=eβ0+βX1+eβ0+βXp(X) =\frac{e^{\beta_0 + \beta X}}{1+e^{\beta_0 + \beta X}} where β0\beta_0 is the intercept and β\beta is the coefficient matrix. Where p(X)1p(X)=eβ0+βX\frac{p(X)}{1-p(X)} = e^{\beta_0 + \beta X} is the odds.

  • each βj\beta_j represents the change of log-odds for one unit increase in XjX_j
  • The conditional probability of P(Y=1X=x)=eβ0+xTβ1+eβ0+xTβ\mathbb{P}(Y = 1 | X = x) = \frac{e^{\beta_0 + x^T\beta}}{1+e^{\beta_0 + x^T\beta}}.
  • log-odds is the logits where log(p1(x)1p1(x))=log(p1(x)p0)=β0+xTβ\log (\frac{p_1(x)}{1-p_1(x)}) =\log (\frac{p_1(x)}{p_0}) = \beta_0 + x^T\beta.

That is, we can use maximum likelihood to estimate the parameters. The function is L(β0,β)=i=1np(Xi)Yi(1p(Xi))1YiL(\beta_0, \beta) = \prod_{i=1}^n p(X_i)^{Y_i} (1-p(X_i))^{1-Y_i} where YiY_i is the binary response.

The log-likelihood is l(β0,β)=i=1nYilogp(Xi)+(1Yi)log(1p(Xi))=i=1nYilogp(Xi)1p(Xi)+log(1p(Xi))=i=1nYiβ0+xiTβ+log(1eβ0+xiTβ1+eβ0+xiTβ)=i=1nYiβ0+xiTβlog(1+eβ0+xiTβ)l(\beta_0, \beta) = \sum_{i=1}^n Y_i \log p(X_i) + (1-Y_i) \log (1-p(X_i)) = \sum_{i=1}^n Y_i \log \frac{p(X_i)}{1-p(X_i)} + \log(1-p(X_i)) = \sum_{i=1}^n Y_i \beta_0 + x_i^T\beta + \log(1 - \frac{e^{\beta_0 + x_i^T\beta}}{1+e^{\beta_0 + x_i^T\beta}}) = \sum_{i=1}^n Y_i \beta_0 + x_i^T\beta - \log(1 + e^{\beta_0 + x_i^T\beta}).

We use Z-statics as the statistical properties of MLE where Z=β^jSE[βj]Z = \frac{\hat \beta_j }{SE[\beta_j]}

The logistic loss we define as the log-likelihood of the model, since most of the negative log-likelihood is convex we can use Gradient Descent to find the optimal solution. We also call this negative log-likelihood as the loss function.

Gradient Descent

Assume β0=0\beta_0 = 0, then we have following:

We have the log likelihood and it's conave so that we need the negative log likelihood (β)=i=1n[yixiTβ+log(1+exiTβ)]-\ell(\beta) = \sum_{i = 1}^n[-y_ix_i^T\beta + \log (1 + e^{x_i^T\beta})] to process gradient descent.

The partial derivative at any β\beta is (β)βj=i=1n[yi+exiTβ1+exiTβ]xij\frac{\partial -\ell(\beta)}{\partial \beta_j} = \sum_{i=1}^n[ -y_i + \frac{e^{x_i^T\beta}}{1+e^{x_i^T\beta}}]x_{ij} where β^(k+1)=β^(k)αi=1n[yi+exiTβ(k)1+exiTβ(k)]xi\hat \beta^{(k+1)} = \hat \beta^{(k)} - \alpha \sum_{i=1}^n[ -y_i + \frac{e^{x_i^T\beta^{(k)}}}{1+e^{x_i^T\beta^{(k)}}}]x_{i} where α\alpha is the learning rate.

  • (β^(k+1))(β^(k))|\ell(\hat \beta^{(k+1)}) - \ell(\hat \beta^{(k)}) | small enough to stop (i.e. 1e6\le 1e-6)
  • β^(k+1)β^(k)2\|\hat \beta^{(k+1)} - \hat \beta^{(k)}\|_2 is small or β^(k+1)β^(k)2/β^(k)2\|\hat \beta^{(k+1)} - \hat \beta^{(k)}\|_2/ \|\hat \beta^{(k)} \|_2 is small